class: center, middle, inverse, title-slide # Introduction to Survey Data Cleaning Using Tidyverse in R ## Introduction ### Johannes Breuer
Stefan Jünger ### 2021-07-22 --- layout: true <div class="my-footer"> <div style="float: left;"><span>Johannes Breuer, Stefan Jünger</span></div> <div style="float: right;"><span>ESRA 2021, 2021-07-22</span></div> <div style="text-align: center;"><span>Introduction</span></div> </div> --- ## About us ### Johannes Breuer .small[ - Senior researcher in the team Data Augmentation, Department Survey Data Curation, [*GESIS - Leibniz Institute for the Social Sciences*](https://www.gesis.org/en/home), Cologne, Germany - (Co-)Leader of the team Research Data & Methods at the [*Center for Advanced Internet Studies*](https://www.cais.nrw/en/center-for-advanced-internet-studies-cais-en/) (CAIS), Bochum, Germany - Main areas: - digital trace data for social science research - data linking (surveys + digital trace data) - Ph.D. in Psychology, University of Cologne - Previously worked in several research projects investigating the use and effects of digital media (Cologne, Hohenheim, Münster, Tübingen) - Other research interests - Computational methods - Data management - Open science [johannes.breuer@gesis.org](mailto:johannes.breuer@gesis.org) | [@MattEagle09](https://twitter.com/MattEagle09) | [personal website](https://www.johannesbreuer.com/) ] --- ## About us ### Stefan Jünger .pull-left[ <img src="data:image/png;base64,#C:\Users\mueller2\talks_presentations\tidyverse-workshop-esra-2021\content\img\stefan.png" width="50%" style="display: block; margin: auto;" /> ] .pull-right[ - Postdoctoral researcher in the team Data Augmentation at the GESIS department Survey Data Curation - Ph.D. in social sciences, University of Cologne ] - Research interests: - quantitative methods & Geographic Information Systems (GIS) - social inequalities & attitudes towards minorities - data management & data privacy - reproducible research .small[ [stefan.juenger@gesis.org](mailto:stefan.juenger@gesis.org) | [@StefanJuenger](https://twitter.com/StefanJuenger) | [https://stefanjuenger.github.io](https://stefanjuenger.github.io) ] --- ## About you Please use the text chat to introduce yourself: - What's your name? - Where do you work? - What do you work on? - What are your experiences with `R` and the `tidyverse`? - What are your motivations for joining this course? What are your expectations for this course? --- ## Prerequisites for this course .large[ - Working versions of `R` and *RStudio* - Some basic knowledge of `R` - The `tidyverse` packages ] --- ## Workshop Structure & Materials - The workshop consists of a combination of short lectures and hands-on exercises - Slides and other materials are available at .center[`https://github.com/jobreu/tidyverse-workshop-esra-2021`] --- ## Course schedule <table> <thead> <tr> <th style="text-align:center;"> When? </th> <th style="text-align:center;"> What? </th> </tr> </thead> <tbody> <tr> <td style="text-align:center;"> 13:00 - 13:20 </td> <td style="text-align:center;"> Introduction: Welcome to the tidyverse </td> </tr> <tr> <td style="text-align:center;"> 13:20 - 13:30 </td> <td style="text-align:center;"> Exercise 1 </td> </tr> <tr> <td style="text-align:center;"> 13:30 - 13:45 </td> <td style="text-align:center;"> Data Import </td> </tr> <tr> <td style="text-align:center;"> 13:45 - 14:00 </td> <td style="text-align:center;"> Exercise 2 </td> </tr> <tr> <td style="text-align:center;"> 14:00 - 14:30 </td> <td style="text-align:center;"> Data Wrangling - Part 1 </td> </tr> <tr> <td style="text-align:center;"> 14:30 - 14:45 </td> <td style="text-align:center;"> Exercise 3 </td> </tr> <tr> <td style="text-align:center;"> 14:45 - 15:00 </td> <td style="text-align:center;"> <i>Coffee break</i> </td> </tr> <tr> <td style="text-align:center;"> 15:00 - 15:30 </td> <td style="text-align:center;"> Data Wrangling - Part 2 </td> </tr> <tr> <td style="text-align:center;"> 15:30 - 15:45 </td> <td style="text-align:center;"> Exercise 4 </td> </tr> <tr> <td style="text-align:center;"> 15:45 - 16:00 </td> <td style="text-align:center;"> Wrap-Up </td> </tr> </tbody> </table> --- ## Online format - If possible, we invite you to turn on your camera - If you have an immediate question during the lecture parts, please send it via text chat - Public or private (ideally to the person currently not presenting if you want an immediate response) - If you have a question that is not urgent and might be interesting for everybody, you can also use audio (& video) to ask it during the exercise parts - We would also kindly ask you to mute your microphones when you are not asking (or answering) a question --- ## What is the `tidyverse`? > The `tidyverse` is an .highlight[opinionated collection of R packages designed for data science]. All packages share an .highlight[underlying design philosophy, grammar, and data structures] ([Tidyverse website](https://www.tidyverse.org/)). > The `tidyverse` is a .highlight[coherent system of packages for data manipulation, exploration and visualization] that share a .highlight[common design philosophy] ([Rickert, 2017](https://rviews.rstudio.com/2017/06/08/what-is-the-tidyverse/)). <img src="data:image/png;base64,#C:\Users\mueller2\talks_presentations\tidyverse-workshop-esra-2021\content\img\hex-tidyverse.png" width="25%" style="display: block; margin: auto;" /> --- ## Benefits of the `tidyverse` .large[ Most of the things we are going to show you can also be achieved with base `R`. However, the syntax for this is typically (more) verbose and not intuitive and, hence, difficult to learn, remember, and read (plus many `tidyverse` operations are faster than their base `R` equivalents). ] --- ## Benefits of the `tidyverse` .large[ `Tidyverse` syntax is designed to increase **human-readability**. This makes it especially **attractive for `R` novices** as it can facilitate the experience of **self-efficacy** (see [Robinson, 2017](http://varianceexplained.org/r/teach-tidyverse/)). The `tidyverse` also aims for **consistency** (e.g., data frame as first argument and output) and uses **smarter defaults** (e.g., no partial matching of data frame and column names). ] --- ## `tidyverse` for `R` beginners <img src="data:image/png;base64,#C:\Users\mueller2\talks_presentations\tidyverse-workshop-esra-2021\content\img\DistractedBf.png" width="75%" style="display: block; margin: auto;" /> --- ## Workflow .center[ <img src="data:image/png;base64,#C:\Users\mueller2\talks_presentations\tidyverse-workshop-esra-2021\content\img\data-science.png" width="60%" style="display: block; margin: auto;" /> ] <small><small>Source: http://r4ds.had.co.nz/</small></small> .highlight[- **Import**: read in data in different formats (e.g., .csv, .xls, .sav, .dta) - **Tidy**: clean data (1 row = 1 case, 1 column = 1 variable), rename & recode variables, etc. - **Transform**: prepare data for analysis (e.g., by aggregating and/or filtering)] - **Visualize**: explore/analyze data through informative plots - **Model**: analyze the data by creating models (e.g, linear regression model) - **Communicate**: present the results (to others) --- ## `Tidyverse` workflow .center[ <img src="data:image/png;base64,#C:\Users\mueller2\talks_presentations\tidyverse-workshop-esra-2021\content\img\tidyverse-1200x484.png" width="1600" style="display: block; margin: auto;" /> ] <small><small>Source: http://www.storybench.org/getting-started-with-tidyverse-in-r/</small></small> --- ## Lift-off into the `tidyverse` 🚀 **Install all `tidyverse` packages** (for the full list of `tidyverse` packages see [https://www.tidyverse.org/packages/](https://www.tidyverse.org/packages/)) ```r install.packages("tidyverse") ``` **Load core `tidyverse` packages** (NB: To save time and reduce namespace conflicts it can make sense to load the `tidyverse` packages individually) ```r library("tidyverse") ``` --- ## `tidyverse` vocab 101 We will focus on three key things here: 1. Tidy data 2. Tibbles 3. Pipes --- ## Tidy data The 3 rules of tidy data: 1. Each **variable** is in a separate **column**. 2. Each **observation** is in a separate **row**. 3. Each **value** is in a separate **cell**. <img src="data:image/png;base64,#C:\Users\mueller2\talks_presentations\tidyverse-workshop-esra-2021\content\img\tidy_data.png" width="2560" style="display: block; margin: auto;" /> Source: https://r4ds.had.co.nz/tidy-data.html *NB*: In the `tidyverse` terminology 'tidy data' usually also means data in long format (where applicable). --- ## Wide vs. long format <img src="data:image/png;base64,#C:\Users\mueller2\talks_presentations\tidyverse-workshop-esra-2021\content\img\wide-long.png" width="90%" style="display: block; margin: auto;" /> Source: https://github.com/gadenbuie/tidyexplain#tidy-data --- ## Tibbles .pull-left[ Tibbles are basically just `R data.frames` but nicer. - only the first ten observations are printed - output is tidier! - you get some additional metadata about rows and columns that you would normally only get when using `dim()` and other functions You can check the [tibble vignette](https://cran.r-project.org/web/packages/tibble/vignettes/tibble.html) for technical details. ] .pull-right[ <img src="data:image/png;base64,#C:\Users\mueller2\talks_presentations\tidyverse-workshop-esra-2021\content\img\tibble.png" width="60%" style="display: block; margin: auto;" /> ] --- ## A `data.frame` .small[ ``` ## Murder Assault UrbanPop Rape ## Alabama 13.2 236 58 21.2 ## Alaska 10.0 263 48 44.5 ## Arizona 8.1 294 80 31.0 ## Arkansas 8.8 190 50 19.5 ## California 9.0 276 91 40.6 ## Colorado 7.9 204 78 38.7 ## Connecticut 3.3 110 77 11.1 ## Delaware 5.9 238 72 15.8 ## Florida 15.4 335 80 31.9 ## Georgia 17.4 211 60 25.8 ## Hawaii 5.3 46 83 20.2 ## Idaho 2.6 120 54 14.2 ## Illinois 10.4 249 83 24.0 ## Indiana 7.2 113 65 21.0 ## Iowa 2.2 56 57 11.3 ## Kansas 6.0 115 66 18.0 ## Kentucky 9.7 109 52 16.3 ## Louisiana 15.4 249 66 22.2 ## Maine 2.1 83 51 7.8 ## Maryland 11.3 300 67 27.8 ## Massachusetts 4.4 149 85 16.3 ## Michigan 12.1 255 74 35.1 ## Minnesota 2.7 72 66 14.9 ## Mississippi 16.1 259 44 17.1 ## Missouri 9.0 178 70 28.2 ## Montana 6.0 109 53 16.4 ## Nebraska 4.3 102 62 16.5 ## Nevada 12.2 252 81 46.0 ## New Hampshire 2.1 57 56 9.5 ## New Jersey 7.4 159 89 18.8 ## New Mexico 11.4 285 70 32.1 ## New York 11.1 254 86 26.1 ## North Carolina 13.0 337 45 16.1 ## North Dakota 0.8 45 44 7.3 ## Ohio 7.3 120 75 21.4 ## Oklahoma 6.6 151 68 20.0 ## Oregon 4.9 159 67 29.3 ## Pennsylvania 6.3 106 72 14.9 ## Rhode Island 3.4 174 87 8.3 ## South Carolina 14.4 279 48 22.5 ## South Dakota 3.8 86 45 12.8 ## Tennessee 13.2 188 59 26.9 ## Texas 12.7 201 80 25.5 ## Utah 3.2 120 80 22.9 ## Vermont 2.2 48 32 11.2 ## Virginia 8.5 156 63 20.7 ## Washington 4.0 145 73 26.2 ## West Virginia 5.7 81 39 9.3 ## Wisconsin 2.6 53 66 10.8 ## Wyoming 6.8 161 60 15.6 ``` ] --- ## A `tibble` .small[ ``` ## # A tibble: 50 x 4 ## Murder Assault UrbanPop Rape ## <dbl> <int> <int> <dbl> ## 1 13.2 236 58 21.2 ## 2 10 263 48 44.5 ## 3 8.1 294 80 31 ## 4 8.8 190 50 19.5 ## 5 9 276 91 40.6 ## 6 7.9 204 78 38.7 ## 7 3.3 110 77 11.1 ## 8 5.9 238 72 15.8 ## 9 15.4 335 80 31.9 ## 10 17.4 211 60 25.8 ## # ... with 40 more rows ``` ] --- ## Converting dataframes into tibbles You can convert any `data.frame` into a `tibble`: ```r data("USArrests") tibble::as_tibble(USArrests) ``` .small[ ``` ## # A tibble: 50 x 4 ## Murder Assault UrbanPop Rape ## <dbl> <int> <int> <dbl> ## 1 13.2 236 58 21.2 ## 2 10 263 48 44.5 ## 3 8.1 294 80 31 ## 4 8.8 190 50 19.5 ## 5 9 276 91 40.6 ## 6 7.9 204 78 38.7 ## 7 3.3 110 77 11.1 ## 8 5.9 238 72 15.8 ## 9 15.4 335 80 31.9 ## 10 17.4 211 60 25.8 ## # ... with 40 more rows ``` ] --- ## The logic of pipes Usually, in `R` we apply functions as follows: ```r f(x) ``` In the logic of pipes this function is written as: ```r x %>% f(.) ``` -- We can use pipes on more than one function: ```r x %>% f_1() %>% f_2() %>% f_3() ``` More details: https://r4ds.had.co.nz/pipes.html --- ## Pipes everywhere... ```r library(memer) meme_get("OprahGiveaway") %>% meme_text_bottom("EVERYONE GETS A %>%!!!", size = 36) ``` <img src="data:image/png;base64,#C:\Users\mueller2\talks_presentations\tidyverse-workshop-esra-2021\content\img\OprahGiveaway.png" width="60%" style="display: block; margin: auto;" /> --- ## Resources There are hundreds of tutorials, courses, blog posts, etc. about the `tidyverse` available online. The book [*R for Data Science*](https://r4ds.had.co.nz/) by [Hadley Wickham](http://hadley.nz/) and [Garrett Grolemund](https://twitter.com/statgarrett) (which is available for free online) provides a very comprehensive introduction to the `tidyverse`. The weekly [Tidy Tuesday](https://github.com/rfordatascience/tidytuesday) data projects and the associated [#tidytuesday Twitter hashtag](https://twitter.com/hashtag/tidytuesday?lang=en) are also a fun way of learning and practicing data wrangling and exploration with the `tidyverse`. --- ## Cheatsheets *RStudio* offers a good collection of [cheatsheets for R](https://www.rstudio.com/resources/cheatsheets/). The following two are of particular interest for this workshop: - [Data Import Cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/data-import.pdf) - [Data Transformation Cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf) --- class: center, middle # [Exercise](https://jobreu.github.io/tidyverse-workshop-esra-2021/exercises/Exercise_1.html) time 🏋️♀️💪🏃🚴 ## [Solutions](https://jobreu.github.io/tidyverse-workshop-esra-2021/solutions/Exercise_1.html)